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Abstract —In data science, it is often required to estimate 
dependencies between different data sources. These depen¬ 
dencies are typically calculated using Pearson’s correlation, 
distance correlation, and/or mutual information. However, none 
of these measures satisfy all the Granger’s axioms for an 
“ideal measure”. One such ideal measure, proposed by Granger 
himself, calculates the Bhattacharyya distance between the joint 
probability density function (pdf) and the product of marginal 
pdfs. We call this measure the mutual dependence. However, 
to date this measure has not been directly computable from 
data. In this paper, we use our recently introduced maximum 
likelihood non-parametric estimator for band-limited pdfs, to 
compute the mutual dependence directly from the data. We 
construct the estimator of mutual dependence and compare 
its performance to standard measures (Pearson’s and distance 
correlation) for different known pdfs by computing convergence 
rates, computational complexity, and the ability to capture 
nonlinear dependencies. Our mutual dependence estimator 
requires fewer samples to converge to theoretical values, is 
faster to compute, and captures more complex dependencies 
than standard measures. 

I. Introduction 

In data science and modeling, it is often required to 
test whether two random variables are independent. Out of 
several measures that quantify dependencies between random 
variables [1]—[4], the most widely used are mutual informa¬ 
tion I, Pearson’s correlation r, and distance correlation R. 

Mutual information, /, is generally thought of as a bench¬ 
mark for quantifying dependencies between random vari¬ 
ables; however, it can only be computed by first estimating 
the joint and marginal probability density functions (pdfs). 
Pearson’s correlation, r, can be directly estimated from data, 
but it does not capture nonlinear dependencies. Distance 
correlation, R, can also be directly estimated directly from 
data and can capture nonlinear dependencies, but is in general 
slow to compute (computational complexity 0(n 2 )). Further, 
distance correlation often does not reflect the nonlinear 
dependencies correctly as described succinctly by Renyi’s 
axioms [5], which were slightly improved upon by Granger, 
Maasoumi and Racine [6]. See Table [I] Specifically, distance 
correlation is not invariant under strictly monotonic transfor¬ 
mations (6th axiom in Table [I]). 

An “ideal measure” should satisfy axioms given in Table [T] 
and should be directly estimable from the data. A less pop¬ 
ular and unnamed measure uses the Bhattacharyya distance 
between the joint pdf and the product of the marginals as a 
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measure for dependence between two random variables [7], 
[8]. It has been shown that this measure satisfies all six 
axioms. Importantly, this measure is invariant under continu¬ 
ous and strictly increasing transformations [6], [9]. It is also 
closely related to mutual information, k-class entropy and 
copula [ 10]—[ 12]. In this paper, we call this measure mutual 
dependence, d. 


Mutual dependence has not been widely used because, 
like mutual information, it requires non-parametric density 
estimation to compute the marginal and joint pdfs, which are 
then substituted into the theoretical measure and numerically 
integrated to yield estimates. This process is both computa¬ 
tionally complex and inaccurate. In this paper, we develop an 
estimator that estimates mutual dependence directly from the 
data. It uses our recently proposed Band-Limited Maximum 
Likelihood (BLML) estimator that maximizes the data like¬ 
lihood function over a set of band-limited pdfs with known 
cut-off frequency f c . The BLML estimator is consistent, 
efficiently computable, and results in a smooth pdf [13]. The 
BLML estimator also has a faster rate of convergence and 
reduced computational complexity over other widely used 
non-parametric methods such as kernel density estimators. 
Along with these properties, if the BLML estimator is 
substituted into the expression for mutual dependence (see 
([5J), the mutual dependence can be computed directly from 
the data without performing numerical integration, which is 
often inaccurate and inefficient. 


We show through simulations that d converges faster than 
R and r for various data sets with different types of linear 
and nonlinear dependencies, and the convergence rate for 
computing d is maintained for different type of nonlinear¬ 
ities. d is faster to compute than R as it has 0(B 2 + n) 
time complexity, where B is the number of bins containing 
a finite number of samples which is always less than or equal 
to n (the number of data samples). 


The paper is organized as follows. Section [II] discusses 
variation in different measures as a function of mutual infor¬ 
mation and nonlinearity. Section III introduces the notion 
of mutual dependence and its estimator. Section IV uses 


simulation to compare convergence of mutual dependence 
with Pearson’s and distance correlation for different nonlin¬ 
earity dependencies and marginal pdfs. We end the paper 
with conclusions and future work in Section m 




TABLE I 


Desired properties of ideal dependency measure S(X, Y). 


# Property 


r I R d 


1 S(X, Y) = S(Y, X) SYS 

2 6(X. Y) = 0 iff X and Y are independent / / 

3 0 < S(X, Y) < 1 / 

4 ()f.Y, Y) = 1 if there is a strict dependence between X and Y / 

5 6(X. Y) = f(r( X. y)) if the joint distribution of X and Y is normal / / / 

6 5(MX),MY)) = S(X,Y) / 


Note: ip± , il>2 316 strictly monotonic functions 
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Fig. 1. Point clouds. Illustrating point cloud for data generated from |TJ 
for different nonlinearities g(x ) and generating pdfs, p = 0.9 was used for 
generating this data. 


II. A MOTIVATING EXAMPLE 
Consider two random variables X and Y defined as: 

X = V, 

Y = pg(X) + ^/l-p 2 U, 

where U and V follow either a band-limited pdf 

fx(x) = ^ [sinc 4 (0.2ai — 0.1) + sinc 4 (0.2x + 0.1)] 
or a normal pdf 

f / \ 1 - (x ~'l , )2 

fx(x) = — 

(TV 2 7r 

and where g(X) is one of four types of (nonlinear) depen¬ 
dence among 

X, X 2 , X 3 , or sin(X). 

The ‘spread’ p is varied from 0.1 to 0.9 to obtain different 
degrees of dependencies. Figure [T] illustrates the data gener¬ 
ated in this example. 

The goal of the dependency measures is to quantify de¬ 
pendencies between X and Y given the data. In cases where 
underlying pdfs are known these dependency are captured 
pretty nicely by mutual information. 

Therefore in Figure [2] we plot theoretical values for Pear¬ 
son’s and distance correlation of dependence as a function of 
mutual information for the four different nonlinearity types 
and the two different generating pdfs. 


• Mutual information 

/= / log 0 u : ^m) fxr{x ’ v)dxiy <2) 

• Pearson’s correlation 

„ _ E(XY) - E(X)E(Y) m 

y/E(X 2 ) - E(X) 2 ^/E{Y 2 ) - E(Y ) 2 ' 


• Distance correlation 


dCov 2 (X,y) = 
R = 


f I <l>XY{s,t) - (t> x {s)(j>Y{t )| 2 

J |«| 1 +P|i|l+9 

dCov(X, Y) 

\J dCov(X', X)dCov(F, Y) 


dsdf 

(4) 


here 4>xy, 4>x, </>y are the respective characteristic 
functions, p and q are the dimension of X and Y. For 
details see [3], (Note we have eliminated the constants 
c p , c q from the definition of dCov as they are not needed 
to define R ). 

Both Pearson’s and distance correlation measures depend 
largely on the nonlinearity for a given value of mutual infor¬ 
mation. This variability may occur because both the types 
of correlation measures are not invariant to strictly mono¬ 
tonic transformations, unlike mutual information. Therefore, 
changing the type of nonlinearity results in different values 
for both Pearson’s and distance correlation, while the mutual 
information remains invariant. Such variance is undesirable 
as it may lead to incorrect inferences when comparing 
dependencies between data having different types of non¬ 
linear dependencies. Therefore, a measure that is invariant 
to strictly monotonic transformations is desirable. 


III. Mutual dependence and its estimation 

In this section, we introduce the mutual dependence , which 
is based on an unnamed existing measure, and show several 
properties of this measure. Then, we derive an estimator of 
mutual dependence derived directly from data generated from 
band-limited pdfs. Finally, we describe efficient algorithms 
to compute this estimator. 


A. Mutual dependence 

Consider two random variables X and Y, their joint dis¬ 
tribution fxY(x,y), and their marginal distributions fx(x) 
and /y(y). These random variables are independent if and 
only if fxY(x,y) = fx{x) fy(y)- It is therefore natural to 
measure dependence as the distance (in the space of pdfs) 
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Fig. 2. Pearson’s and distance correlation. Illustrating theoretical values 
of r and R as a function of I for different nonlinearities and generating 
pdfs as used in Figure [7] 


between the joint and the product of marginal distributions. A 
good distance candidate is the Bhattacharyya distance (also 
known as Hellinger distance). See [6], [9] for details. 

Definition 1: The mutual dependence d(X , Y ) between 
two random variables X and Y is defined as the Bhat¬ 
tacharyya distance c4(-,-) between their joint distribution 
fxY(x,y) and the product of their marginal distributions 
fx{x) and f Y (y), that is. 


(5) 


dl(p(x),q(x)) - \J (v / K x )-V / ?( x )) dx - (6) 

We call this measure ‘mutual dependence’ as it represents 
mutual information most closely. For a given value of mutual 
information, the value of mutual dependence remains almost 
the same irrespective of the nonlinearity type, which is not 
true for Pearson’s and distance correlation measures. 


B. Properties of mutual dependence 

Due to symmetry of d(-, •), it is easy to see that d(X, Y) = 
d(Y , A'). The measure d £ (0,1) if X and Y are partially de¬ 
pendent which quantifies the degree of dependence between 
the two random variables. In the extreme cases, d = 0 -£=> 
X and Y are independent and d = 1 if either x or y is 
a Borel-measurable function of the other. Also, it can be 
easily established that d is invariant under strictly monotonic 
transformations t/q and ip 2 , i.e d(X, Y) = d{ip i(A), ipfiY)). 
A detailed description of these properties can be found in [6], 

[9]. 

For jointly normal data, the mutual dependence can be 
estimated by first calculating the Bhattacharyya distance 


Fig. 3. Mutual dependence. Illustrating theoretical values of d as a 
function of I for different nonlinearities and generating pdfs as used in 
Figure |T] 


between two multivariate Gaussian distributions [14] 

ISrlbSal* .. 


di= 1 - 


|i£i + iS 2 | 


1. 


-l 


exp ^ T (^ Sl + i Ez J ^ ll ~ ^j 

(7) 

where p \ and are the mean vectors and £1 and £ 2 
covariance matrices. Then substituting 


ID = 0 , 
1*2 = 0 , 


51 = 

5 2 = 


P &x 

0 

0 ol 


P Ox 
_2 


gives 


d(X, Y) = J 1- (1 f 2) \ ± M(p). 


( 8 ) 


This shows that mutual dependence satisfies axiom 5 (see 
Table 1). 

C. Estimation of mutual dependence 

To estimate d. we use the BLML method [13] that 
maximizes the likelihood of observing data samples over the 
set of band-limited pdfs. The BLML estimator is shown to 
outperform kernel density estimators (KDE) both in conver¬ 
gence rates and computational time and hence provides a 
better alternative for non-parametric estimation of pdfs. In 
addition, the structure of the BLML estimator is well suited 
for evaluating the integral in (|5j, resulting in an estimate 
which is a direct function of observed data and hence avoids 
numerical integration errors. 

Below we briefly describe the BLML estimator. 

Theorem 3.1: Consider n independent samples of an un¬ 
known BL pdf, fx{x), with assumed cut-off frequency f c . 
Then the BLML estimator of fx{x) is given as: 


fx (x) = ( - ^ Cj sinc fc (x - Xj) 


(9) 



















where, f c e R m is the assumed cutoff frequency, vectors 
x,’s, with i = 1 • • • n, are the data samples, sinCf c (x) = 


n 

by 


sin(nf ck x k ) 

TTX k 


and the vector c = [ci, • • • 


c = arg max 

Pn,(c)= o 


n 


is given 


( 10 ) 


Here p ni ( c) = J2]=i c j s ij ~ ^ with s ij ~ sinc fc (xj - Xj). 

See [13] for details. Now we introduce the estimator for 
d, d in the following theorem. 

Theorem 3.2: If ( Xi,yi ) i = 1, • • • ,n are n paired 
independent and identically distributed data observations and 
f c is the cut-off frequency parameter. Then the estimator for 
mutual dependence is given as: 


the true pdf is strictly positive, therefore in cases where the 
joint fxY > 0 the estimate, d is also consistent. 

IV. Performance of mutual dependence 

In this section, we evaluate the performance of our estima¬ 
tor for mutual information by first comparing the empirical 
distribution of the estimator with the empirical distribution 
of the estimators for Pearson’s and distance correlation for 
different mutual information values, I, nonlinearities, g(X), 
and generating pdfs, fx[x). We compare the convergence 
of these metrics to the true values for different sample 
sizes. Finally, we compare the computational complexity of 
our estimator with the estimator for distance correlation to 
evaluate the relative computational time needed to implement 
each estimator. 


d= dh(fxY, fxfy) = 


\ 


1--J2 

n ' 


*( xy ) 
c^c (Y) 


(ID 


where c^ XY ^ = {c- XY ^}™ =1 is given by: 


cu' Yy 1 = arg max 


3 XY h 


n 


i 


p (xy) 

rni 


(■=) = £ 

i=i 


P'rv ' l c )—0 

sin(7r/ c (a: i - xj) sin(7rf c (y z - yj)) n 


7T (Xi - Xj) 


Avi - vj) 


',( x ) = { g W} ?=1 is: 


c<*> = arg max 

P™ X) (c)= 0 


n 




(«)=£• 


3=1 


sin(7 Tf c (xj - Xj) _ n 

7T (Xi — Xj) Ci 


and c< y > = {cf '}" = i is: 


- 1 = arg max 


Ptt ^^ ( c )—0 


n 


JY) 


w = E' 


3=1 


sin(7 ifciVi - Vj)) _ n 
tt( Vi~ V]) Ci' 


Proof: The BLML estimators of fxY, fx and fy from 
Theorem |3.1| (using same cut-off frequency [f c , f c ], f c and f c 
respectively) are plugged into ([5]i and the resultant equation 
is integrated which gives d. ■ 


D. Computation of mutual dependence 

As described in [13] solving for c requires exponential 
time. Therefore, heuristic algorithms also described in [13] 
such as BLMLBQP and BLMLTrivial, can be used directly 
to compute c[ X ' t \ c\ X \ c^ Y> approximately for each i for 
small scale (n < 100) and large scale ( n > 100) problems, 
respectively. 

To further improve the computational time BLMLQuick 
algorithm [13] can also be used. BLMLQuick uses binning 
and estimates c\ XY \ c~ X \ cp ■* approximately for each 
i. It is also shown in [13] that both BLMLTrivial and 
BLMLQuick algorithms yield consistent estimate of pdfs if 


A. Comparison of convergence rate for different nonlinear¬ 
ities 

Figures [ 4 ] and [ 5 ] plot the estimated f, R and <i for n = 316 
and n = 10000 from about 50 Monte Carlo runs as a function 
of I for different nonlinearities (linear, quadratic, cubic and 
sinusoidal) and generating pdfs (band-limited and normal). 
Underlaid are the respective theoretical values. Specifically, 
the first row shows about 50 Monte Carlo computation of f 
for different / values, nonlinearities and generating pdfs. It 
can be seen that for both n = 316 and n = 10000, f works 
best for linear and sinusoidal data, but for quadratic data f 
has a larger variance and for cubic data r has a larger bias 
in bandlimited case. The second row shows 50 Monte Carlo 
computations of R for different / values, nonlinearities and 
generating pdfs. It can be seen that for both n = 316 and 
n = 10000, R works best for linear data, but for quadratic 
and sinusoidal data, it has larger bias whereas for cubic data 
it has larger variance. The bottom row shows 50 Monte Carlo 
computation of d for different / values, nonlinearities and 
generating pdfs. It can be seen that d works equally good 
for all nonlinearities and shows less bias and variance than 
both f and R. 

Figure [6] plots the integration (over different I values) 
of mean squared error (IMSE) between the theoretical and 
estimated measures using about 50 Monte Carlo runs, for 
different nonlinearities and generating pdf types. 

r 1 m 

IMSE= / — JP(<5(/)-5(/)) 2 d/ (15) 

171 <=1 

Here, m is the number of Monte Carlo simulations and 
8 is the dependency metric. It can be seen from the Figure 
[6] that the convergence rate is fastest for d irrespective of 
nonlinearity type and/or generating pdf. r and R show an 
equally fast convergence rate for linear and normal data, 
but the rate is slower for nonlinear and non-normal data. 
Specifically, the first row shows convergence of f, from 
which it can be established that convergence of f to the 
theoretical values is fastest for linear data. For nonlinear data, 
the convergence is slo either due to large bias or variance 
as discussed previously. The second row shows convergence 
of R. It can be seen that R does well for linear data, but 
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Fig. 4. Monte Carlo Estimates for band-limited generating pdfs. The Monte Carlo distribution of estimates for different measures for different 
nonlinearities and band-limited generating pdfs, xess mark the estimates calculated using sample sizes n = 316 whereas os mark the estimates calculated 
using sample size n = 10000. d is estimate assuming the cut-off frequency f c = _ x 2 . 
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Fig. 5. Monte Carlo Estimates for normal generating pdfs. The Monte Carlo distribution of estimates for different measures for different nonlinearities 
and normal generating pdf. xes mark the estimates calculated using sample sizes n = 316 whereas os mark the estimates calculated using sample size 
n = 10000. d is estimate assuming the cut-off frequency f c = 1 2 . 
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Fig. 6. Integrated mean squared error vs sample size. Showing the 
Integrated mean squared error as a function of sample size n for different 
measures, different nonlinearities and different generating pdfs, d is estimate 
assuming the cut-off frequency f c = 1 ^ . 


the rate slows down and saturates for nonlinear data again 
due to either large bias or variance. Specially, for cubic 
and band-limited data, the IMSE of R does not decrease 
with increasing the number of samples, this is due to the 
nondecreasing variance of the estimator (see Figure |4j. The 
bottom row shows convergence of d. It can be seen that d 
converges equally well for all data types and generating pdfs. 

B. Comparison of computational time 

The computational complexity of computing f is least 
which is 0(n), whereas computational complexity of com¬ 
puting R is maximum which is 0(n 2 ). d is same as 
computational complexity of BLMLQuick algorithm which 
is 0(B 2 + n ), where B is the number of bins containing 
nonzero number of samples, which is always less than equal 
to n. For dense data therefore computation of d is a 

lot quicker than estimating R in such cases. 

V. Conclusions 

In this paper, we introduced a novel estimator for mea¬ 
suring dependency that can be directly computed from the 
data. Our estimator computes the mutual dependence which 
is an “ideal” measure for dependence between two random 
variables [6]. Our estimator has advantages over mutual 
information estimators as it does not require estimating 
the pdfs from data. It also has advantage over Pearson’s 
and distance correlation estimators as it is invariant under 
strictly monotonic transformation. Further, we showed that 
under simulation, estimators of both Pearson’s and distance 
correlation require more samples to achieve the same in¬ 
tegrated mean squared error (IMSE) as compared to our 
mutual dependence estimator showing lower convergence 


rate. The slower convergence rate for the estimators of 
Pearson’s and distance correlation was due to their higher 
variance and bias for the nonlinearly dependent data. Such 
nonlinearities did not affect our estimator and it showed 
a uniform decrease in IMSE as the sample size increases 
for all tested nonlinearities. Even further, our estimate for 
mutual dependence showed a computational time complexity 
of 0(B 2 + n) where B < n is the number of bins, which 
is superior to the time complexity of distance correlation 
( 0(n 2 )) and is much faster when the data is dense. 

A. Future work 

Although our estimator for the mutual dependence showed 
some nice properties under simulation, it remained to be 
established that it shows consistency for any nonlinearity 
which would require building up a theoretical proof. Further, 
in this paper, we assumed through out that we knew the cut¬ 
off frequency of the band-limited pdf or approximate cut-off 
frequency for the normal pdf (the band where most of the 
power of pdf lies, in case it is not band limited). However, 
in general this cut-off frequency is not known. A more in- 
depth analysis is needed to understand the behavior of our 
estimator as a function of the cut-off frequency. 
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